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Abstract: Comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GCXGC/TOF- 
MS) has been applied to metabolomics analyses recently* However, retention time shifts in the two-dimensional gas 
chromatography will introduce difficulty to compare compound profiles obtained from multiple samples. In this work, a novel 
two-stage peak alignment algorithm has been developed for data analysis of GCXGC/TOF-MS. In the first stage, our algorithm 
detects and merges multiple peak entries of the same metabolite into one peak entry* After a z-score transformation of metabolite 
retention times, landmark peaks will be selected from all samples based on both two-dimensional retention times and mass 
spectrum similarity of fragment ions measured by Pearson's correlation coefficient* In the second stage, the original two- 
dimensional retention time shift will be corrected using a local linear fitting method. A progressive retention time map searching 
method is used to align peaks in all samples together based on the parameters optimized in the first stage. Our algorithm can avoid 
defining a threshold of retention time window and spectrum similarity, which is very difficult for the users since the experimental 
condition is always changed in different experimental runs, even for the repeat experiments. The experimental results show that our 
algorithm can work well in peak alignment from real biological samples, which is very important for the further analysis. 
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L Introduction 

An emerging technology, comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry 
(GCXGC/TOF-MS), brings much increased signal-to-noise ratio, dynamic range, chemical selectivity, and sensitivity to 
metabolomics analyses [1-4]. This approach uses a multidimensional separation technique, where a short column after the main 
analytical column, to separate as many compounds as possible [5,6]. The orthogonal setup of two columns in separation part makes 
GCXGC/TOF-MS platform get an order- of-magnitude increase in separation capacity, which is very important for the analysis of 
many complex samples. 

After analyzing these samples using GCXGC/TOF-MS, it is necessary to recognize molecular features of the same compound 
occurring in different samples from each of the raw instrument data [7,8]. Ideally, the same compound should have the identical 
retention times in the two-dimensional GC if the instrument configuration is the same. However, retention times always shift in 
both GC dimensions as a result of several, sometimes uncontrollable factors such as temperature and pressure fluctuations, matrix 
effects on samples, and stationary phase degradation. Retention time shifts introduce difficulty to compare compound profiles 
obtained from multiple samples. Therefore, aligning the instrumental signals which generated from same compounds in different 
samples, i.e., peak alignment, have to consider the retention time variation [9]. 

Currently, four studies addressed alignment issue for the two-dimensional GC separations using the raw instrument data as input 
material. Fraga et al. developed a rank-based algorithm using the generalized rank annihilation method (GRAM) to correct retention 
time variations in the two- dimensional GC [10,11] . Mispelaar et al. developed a correlation- optimized shifting-based algorithm to 
align a local region of a GCXGC chromatogram [12]. These two methods can only be used to align small regions of interest in the 
two-dimensional GC data set. To correct the entire chromatogram in both GC dimensions, Pierce et al. proposed an indexing 
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scheme together with a piecewise retention time alignment algorithm [13]. Zhang et aL developed a two dimensional correlation 
optimized warping (2-D COW) method by extending the correlation optimized warping method from I-D to 2-D [14,15]. 
However, these methods align the GCxGC/TOF-MS data based on two-dimensional retention times alone, even though the 
signature feature of a metabolite, i.e., mass spectrum of fragment ions, is readily available in the raw instrument data. Aligning 
metabolite peaks solely based on the two dimensional retention times may introduce a high rate of false alignment because some 
metabolites with similar chemical functional groups have similar retention times in both GC dimensions. For this reason, two peak 
alignment methods, MSort [16] and DISCO [17], were developed. In these two methods, the raw instrument data are first subjected 
for spectrum deconvolution to generate a list of metabolite peaks for each sample, of which each metabolite peak is characterized by 
multiple molecular features including retention times in the two-dimensional GC, peak area, fragment spectrum, and other associated 
features. Both of MSort and DISCO employ two dimensional retention times and the mass spectrum of compound fragment ions 
for peak alignment. MSort was designed to align homogeneous data, while DISCO can align homogeneous and heterogeneous data. 
These two methods greatly reduced the rate of false alignment compared to existing alignment approaches. However, MSort and 
DISCO softwares use a user-defined retention time window with a fixed size in the two retention time dimensions to filter the peak 
candidates first, then the peak pair with the highest similarity will be choose as corresponding peaks in the different experiments if 
its value is bigger than a user-defined threshold. The separated application of retention time distance and spectrum similarity increase 
false alignment since there is no a golden criteria to select the distance window and the spectrum similarity threshold. 

To overcome the limitations of current alignment algorithms, this paper reports a novel alignment algorithm for GCXGC/TOF- 
MS which can consider the distance and spectrum similarity together and automatically set up the threshold values. After the peak 
lists are generated from instrumental software, the present algorithm is implemented using a two -stage strategy. In the first stage, 
landmark peaks, a set of compound peaks present in each biological sample, are selected from the entire original peak lists using a 
mixture similarity comprised of the Euclidean distances of two-dimensional retention times and mass spectrum similarity of two 
corresponding peaks. In the second stage, retention time shifts are corrected using a local partial linear fitting method to handle non- 
linear retention time distortion, and the compound peaks of all samples then are aligned using a progressive retention time map 
searching method. 
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Figure 1. Workflow of the two-stage peak alignment algorithm. 



Volume No: 7, Issue: 8, April 2013, e20 1 304002 



Computational and Structural Biotechnology Journal | www.csbj.org 



Two-stage Peak Alignment Algorithm far MetabalDmics 



2. Methods 

For each experimental run, ChromaTOF will generate a peak list analyzed by instrumental software under predefined parameters. 
For all of the peak lists, our algorithm will align them using a two-stage peak alignment algorithm: stage one is full alignment which 
will find the peaks present in all lists and align them together; stage two is partial alignment which will align all remained peaks from 
stage one based on the results of full alignment* The workflow can be described as Figure L 

In our method, a mixture similarity measurement is adopted to judge two peaks from different peak list are from the same 
compound or not. This mixture similarity m ^ can be calculated as follows: 

mS = wx (1 - (d I d mdx ) 2 ) + (1 - w) x Sim (1) 

where w is contribution factor, d is the Euclidean distance of two peaks in two-dimensional retention time space, ^ max is the 
maximum distance for all peaks within all samples, and Sim [ s Pearson's correlation value of spectra of two peaks. 

The alignment algorithm is implemented using a two -stage strategy: full alignment and partial alignment. In the full alignment, 

the peaks presented in all peak lists will be found and aligned together. Meanwhile, the parameter w and ^ max in the formula (I) can 
be obtained for the partial alignment. 



2. 1 Full alignment 
2*LI Peak merging 

Assuming there is a set of peak lists: S{s l ,s 2 X> ,s N )> our algorithm will merge the peaks which come from the same compound 
in each peak list. Ideally, all instrument signals generated by one type of metabolite should be reported as a single peak in the output 
file of ChromaTOF software, i.e., one peak entry in the metabolite peak list. However, multiple peak entries of the same metabolite 
can be reported due to the abnormal metabolite peak shape and/ or the high sensitivity of the peak detection algorithm. Therefore, 
the multiple entries should be merged before peak alignment. 

1) Initialize a user-defined retention time window RTIw, RT2w, and a user-defined mass spectrum similarity^ 0 , set z " ~^J~^; 

2) For the * peak A'( r ^-i' r ^i) in the J peak list Sj , if the * peak is the last peak in the peak list Sj , set * — 1» ./ — 7" + 1 w hen J < ^ 

. yy pfinal 

, or stop when J ~ , otherwise our algorithm will store this peak into a peak set candiate anc J extract all peaks within 

prti (rt2 rt2 + RTTm?) P rtl P rtl . _ . 1 

3) If candiate is not empty, extracts all peaks within v <' l ' } as candiate within candiate , else set 1 ~ 1 + i go to step 2); 

prtl prtl 

4) If candiate { s not empty, the spectrum similarity between ^ l and all the peaks in candiate W {JJ k e calculated. If there are some peaks 

have similarity values which are bigger ^° , those peaks can be seen as multiple entries candidate Candiate Q f Pi f e J se set i = i + \ g Q to 
step 2); 

P P pfinal , j _ . 1 . nj^j pfinal 

5) If candiate is not empty, store candiate into candiate , let z ' 1 and go to step 2), else if the number of peaks in candiate is 

pfinal . S 

bigger than I, our algorithm will extract all the peaks in candiate from the peak list ; , then merge them together as a new peak within 
Sj and set its peak area and retention time as: 

(2) 
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A "» w (3) 

A A 7 

where p « denotes peak area of the representative peak, Pi is peak area of the ith multiple entry to be merged, ^ is the index number 

RT RT 

of multiple peak entries to be merged, p » denotes retention time of the representative peak, Pi is the retention time of the ith 

multiple entry to be merged. Here, the retention time is either the first dimension retention time or the second dimension retention 
time. 



6) Set — + 1 , and go to step 2); 



2.1*2 Z-score transformation 

For typical GCXGC/TOF-MS configuration, the second GC column is much shorter than the first column, which will result in 
the second retention time is smaller than the first retention time obviously. To balance the contribution of the two dimensional 
retention times, the retention time values in both the first and the second dimension GC are therefore transformed into z -scores as 
following: 

RT X -RT RT 2 -RT 2u 
RT lz = — l - ^ RT 2z = — ^ 

RT la f " RT 2a (4) 

RT RT RT 

where lz is the z-score value after transformation, 1 is the original value of the first dimension retention time, lM is the mean 

RT 

value of the original first dimension retention times within a peak list, 1(7 is the standard deviation of the original first dimension 
retention times. Accordingly, the symbols in the second dimension retention time have similar meanings. For all the peaks in all peak 

lists, the pairwise distances will be calculated and the maximum value is chosen as ^ max . 
2J.3 Landmark peak finding 

The purpose of full alignment is to find a list of landmark peaks which present in all peak lists. Therefore, one of peak lists has 
been selected randomly from sample set ^^ s i ,s 2^ > s n) as reference sample Sr , and the remain are seen as target sample set 
S {s t }{i 1,2, A ,N 1) Yhen landmark peaks can be found as follows: 

1) Selecting a sample randomly as the target sample T from the set of remaining samples S , then S =l*,}(i = U ,N-ty 

2) For each peak P r in the reference sample Sr , our algorithm will extract the peaks with same name from the target sample St as a 

psameName 

candidate peak set candidate f anc J choose the one with the minimum distance from P r in z-score transformation space of retention 
time as the corresponding peak P % of Pr in St . 

3) Storing the peak pair ( Pr , P* ) into a set of landmark peaks P P . 

4) Updating the reference sample Sr using the peaks which has found the corresponding peak, and if the number of sample in $ is 
not one, go to step I), else stop. 

After landmark peaks set P P has been found, we can get an optimized value of w from a candidate value set 

(0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95). For each sample pair ^'^) w here Sr is the reference sample and s t (' = 1 > 2 > A .W" 1 ) 
from the remained samples, our algorithm calculate the mixture similarity m ^ m of the each landmark peak using the formula (I), 
and get a summary value m ^ for each sample pair. And choose the w value with the largest summary value of m ^ as the optimized 
value for each sample pair. 
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2.2 Partial alignment 

The partial alignment is to align the remained peaks which are not the landmark peaks selected the full alignment* After full 

alignment, we already get the value of ^ max and w in formula (I) and a set of landmark peaks which present in all the samples. 
However the remained peaks also should be aligned together even they present in part of samples. Our algorithm will use an outlier 
detection method to decide the threshold of distance and similarity automatically, and implement partial alignment based on those 
threshold values. 

2*2*1 Outlier detection 

Our algorithm considers the landmark peaks as true peak alignment, therefore the information from the landmark peaks can be 
used to help the partial alignment because the remained peaks are detected by the same experimental conditions and the same data 

processing procedure. Here we select the maximum peak distance the minimum similarity ^^ m ° and the minimum mixture 

similarity as threshold to judge the remained peaks from different sample come from same compound or not. 

However, because the complexity of sample source and some inevitable variation from experiment, the data points will be further 
away from the mean value of the above three parameters than what is deemed reasonable. The simple selection of the extreme value 
of parameters as the threshold will cause faulty alignment of peaks. Therefore, our algorithm remove the outliers using Grubbs' test 

method from the above three parameters, and set the threshold values ^° , and after outlier removal. 
2*2*2 Peak alignment 

Our algorithm aligns the remained peaks in a pair wise way, i.e. ( s r> s t) where Sr is the reference sample and St ^ ,N — 1) 

from the remained samples. For each sample pair ^ ' 5 r ) , the remained peaks are aligned as follows: 

1) For each peak ^ z m SR , the peak ^ z in s t , if they satisfy the criteria that ^(PiiPj) ^^^nd ^i m (Pi> Pj) > Si m o ^ t j 1£ p ea k p a { r 

(p.p.). p l 

' j will be selected as a candidate peak pair and store into a candidate set candidate t 

2) Within candidate f a U peak pairs have the same name which assigned by ChormaTOF software will be extracted as a subgroup 

psameName ~p sameName 

candidate f anc J assign the remained peak pairs as another subgroup candidate t 

psameName 

3) For each compound name, we account the times present within candidate t {f f t j ust presents in one peak pair, this peak pair will be 

P l 

seen as the same compound and transfer this pair into aligned peak pair set all s ned * if it presents in more than one peak pair, the peak 

pi psameName 

pair with the largest m ^ value will be transferred into aligned , and remove all pairs in candidate consists of either peak of this peak 

psameName 

pair. Repeat this step until candidate [ s empty. 

p sameName p 1 

4) Removing all peak pairs in candidate w i tn tne same p ea k w ithin aligned . 

p sameName psameName p l 

5) If candidate { s not empty, our algorithm will transfer the peak pair with the largest m ^ value from candidate [ nto aligned ^ anc j 

p sameName p sameName 

remove all pairs in candidate consists of either peak of this peak pair. Repeat this step until candidate [ s empty. 



6) The above steps I)-5) will be repeated until all sample pair^'^) ^ 1>2,A ,N 1) are a l£g nec J anc J g^^atigned are combined into 

P 

partial alignment result allgned . 

7) After partial alignment, our align algorithm will be finished with combination of full alignment result and partial alignment 

p 

result aligned . 
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3. Results and discussion 

3.1 experimental data 

A mixture of 76 compounds (8270 MegaMix, Restek Corp., Bellefonte, PA) and C7-C40 saturated n-alkanes (Sigma-Aldrich 
Corp., St. Louis, MO) were spiked with a deuterated six component semi-volatiles internal standard (ISTD) mixture (Restek Corp., 
Bellefonte, PA) at a concentration of 2.5 jug/ mL prior to GCxGC/TOF-MS analysis. 

The performance of our algorithm was tested by aligning three datasets, which are replicate analyses of the same sample using an 
identical two dimensional GC configuration with different column temperature gradients, i.e., 5 °C /min, 7 °C/min, and 10 °C 
/ min, respectively. To evaluate the performance of our proposed algorithm more objectively, the experiments ramped at different 
temperature gradients have been repeated different times, i.e., 10 replicate analyses for 5 °C /min, 3 replicate analyses for 7 °C/min, 
and 4 replicate analyses for 10 °C/min. The design of the different repeat times of different experimental conditions will help us to 
evaluate how robustness our algorithm is. The experiments can be found in our previous work in more details [17]. 

3.2 performance measurements 

Here the algorithm performance is evaluated by three measurements, i.e., true positive rate (TPR), positive predictive value 
(PPV), and FI score of the peak alignment. For the sample set ^^ s ^ s ^ > s n) t if a compound presents in all N samples, it will be 
called as a positive peak pair. After peak alignment, if the number of positive peak pairs is ^ p and the number of matched peak pairs 
is ^ m , then the values of TPR, PPV and FI score can be calculated as follows: 

TPR = TP /(TP + FN) 

PPV = TP I (TP + FP) 

Fl = 2- TPR • PPV /(TPR + PPV ) 

where TP is the number of positive peak pairs that were aligned as positive (true positive), FP is the number of negative peak pairs 
that were aligned as positive (false positive) and is ^ m -TP, FN is the number of positive peak pairs that were not aligned (false 
negative) and is ^ p -TP. TPR is also called recall and PPV precision and FI score is their harmonic mean. 



3.3 Parameters finding 

One of advantages of our algorithm proposed here is there are no predefined parameters when the peak alignment is 
implemented. The method can automatically decide the threshold of Pearson's correlation of two spectra, the distance window in z- 
score transformation retention time space and the mixture similarity value. However, the complexity of samples and some inevitable 
variations of experiments may make the distributions of these parameters not normal. Our algorithm decides the thresholds based on 
the landmark peaks, which are presented all the samples. Except the information of retention time and spectrum similarity, our 
algorithm also uses the peak name assigned by the ChromaTOF software. After the full alignment, the distribution of parameters: 
Eculid distance, spectrum similarity m ^ , and mixture similarity ^Sim can b e found in Figure 2. It can be seen that the each 
distribution of three parameters has an apparent tail. Therefore, the simple selection of maximum or minimum of these parameters 
may cause some inaccurate setting. Our algorithm uses an outlier detection method to remove the observations which are distant 
from the rest of the parameters, which makes the selections more reasonable. 

3.4 Alignment performance 

Our algorithm is tested using the experimental data which are generated under different temperature gradient setup, i.e., 5 
°C/min, 7 °C/min, and 10 °C/min. Therefore, 17 peak lists can be obtained from ChromaTOF where each peak list means one 
sample analyzed by GCXGC/TOF-MS, and they can be denoted as S5(s*° c ,s 5 2 ° c ,L ,s 5 m c ), Sl(sl° c ,s^ c ,s]° c ) , and 
Sl0(sl°° c ,s l 2 °° c ,s l 3 °° c ,s\°° c ) f° r tne different temperature gradients respectively. After peak alignment, the overall performance can be 
found in Figure 3. 
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Mixture Similarity 

Figure 2. The distribution of Pearson's correlation, Euclid distance and mixture similarity of peak pairs. 

Obviously, the performance will be changed along with the sample size because the more samples will be aligned, the fewer peaks 

will present in all samples, and the fewer peaks will be aligned together. The number of positive peak pairs ^ p is 72 for the 7, 67 

for 10, 66 for 5 and 63 for all degree gradients experiments. The number of matched peak pairs ^ m is 64, 64, 57 and 43 for 7, 10, 
5 and all degree gradients experiments. 




S5 S7 S10 S5+S7+S10 S5 S7 S10 S5+S7+S10 

Sample sets Sample sets 

Figure 3. The overall performance of peak alignment. 
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For Figure 3, we can also found that the PPV values our algorithm achieved in all different data sets are 1.0, which means false 
positive is zero based on the above definition of PPV* Our alignment algorithm can make sure the peaks what we aligned together 
are come from the same compound, therefore it can solve the false positive problem which is the most limitation of current peak 
alignment methods. 

Conclusions 

This paper proposed a new two-stage peak alignment algorithm, which is consist of full alignment and partial alignment, to align 
the GCXGC/TOF-MS data. The present algorithm uses a set of peak lists which is output of the instrument control software, 
ChromaTOF, as its input data. In the full alignment, multiple peak entries of the same compound were detected firstly and merged 
into one peak. A z-score transformation was applied to balance the contribution of the first and second dimensional retention times 
in the peak distance calculation. Landmark peaks then were found based on the information from two-dimensional retention times, 
the mass spectrum similarity and the assigned compound name by ChromaTOF. In the partial alignment, our algorithm can 
automatically set up the threshold of peak distance, mass spectrum similarity and mixture similarity value to recognize the same 
compound from different experiments. A progressive retention time map searching method then is used to align metabolite peaks in 
all samples together based on the landmark peaks found in the full alignment. 

Our algorithm can avoid a user-defined threshold of retention time window in the first and second dimension, as well as a 
threshold of spectrum similarity, which is very difficult task for the users since the experimental condition is always changed in 
different experimental runs, even for the repeat experiments. This kind of design can avoid the problems caused by improper 
parameter selection and inconsistency among samples. Another advantage of this work is taking full advantage of the information 
generated by experimental instrumental software, i.e. the two dimensional GC retention times, fragment ion spectrum correlation and 
database information which implicated by the compound name assigned by ChromaTOF. 

The performance of the present algorithm was tested by experimental data where the samples were analyzed under different 
experiment conditions. The results show that our algorithm can work well in peak alignment from real biological samples. As a 
critical step of data pre-processing, the alignment results achieved by our propose methods can be used effectively for further analysis 
such as pattern recognition and statistical significance testing. 
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